Goto

Collaborating Authors

 task completion rate


AttackPilot: Autonomous Inference Attacks Against ML Services With LLM-Based Agents

Wu, Yixin, Wen, Rui, Cui, Chi, Backes, Michael, Zhang, Yang

arXiv.org Artificial Intelligence

Inference attacks have been widely studied and offer a systematic risk assessment of ML services; however, their implementation and the attack parameters for optimal estimation are challenging for non-experts. The emergence of advanced large language models presents a promising yet largely unexplored opportunity to develop autonomous agents as inference attack experts, helping address this challenge. In this paper, we propose AttackPilot, an autonomous agent capable of independently conducting inference attacks without human intervention. We evaluate it on 20 target services. The evaluation shows that our agent, using GPT-4o, achieves a 100.0% task completion rate and near-expert attack performance, with an average token cost of only $0.627 per run. The agent can also be powered by many other representative LLMs and can adaptively optimize its strategy under service constraints. We further perform trace analysis, demonstrating that design choices, such as a multi-agent framework and task-specific action spaces, effectively mitigate errors such as bad plans, inability to follow instructions, task context loss, and hallucinations. We anticipate that such agents could empower non-expert ML service providers, auditors, or regulators to systematically assess the risks of ML services without requiring deep domain expertise.


Using Vision Language Models as Closed-Loop Symbolic Planners for Robotic Applications: A Control-Theoretic Perspective

Wang, Hao, Karnik, Sathwik, Lim, Bea, Bansal, Somil

arXiv.org Artificial Intelligence

Large Language Models (LLMs) and Vision Language Models (VLMs) have been widely used for embodied symbolic planning. Y et, how to effectively use these models for closed-loop symbolic planning remains largely unexplored. Because they operate as black boxes, LLMs and VLMs can produce unpredictable or costly errors, making their use in high-level robotic planning especially challenging. In this work, we investigate how to use VLMs as closed-loop symbolic planners for robotic applications from a control-theoretic perspective. Concretely, we study how the control horizon and warm-starting impact the performance of VLM symbolic planners. We design and conduct controlled experiments to gain insights that are broadly applicable to utilizing VLMs as closed-loop symbolic planners, and we discuss recommendations that can help improve the performance of VLM symbolic planners. The project website can be found here.


TPS-Bench: Evaluating AI Agents' Tool Planning \& Scheduling Abilities in Compounding Tasks

Xu, Hanwen, Huang, Xuyao, Liu, Yuzhe, Yu, Kai, Deng, Zhijie

arXiv.org Artificial Intelligence

Large language model (LLM) agents have exhibited strong problem-solving competence across domains like research and coding. Yet, it remains underexplored whether LLM agents can tackle compounding real-world problems that require a diverse set of tools to complete. Given a broad, heterogeneous tool repository, LLM agents must not only select appropriate tools based on task planning analysis but also strategically schedule the execution order to ensure efficiency. This paper introduces TPS-Bench to benchmark the ability of LLM agents in solving such problems that demand Tool Planning and Scheduling. TPS-Bench collects 200 compounding tasks of two difficulty levels, based on a tool repository containing hundreds of model context protocol (MCP) tools. In particular, each task is composed of multiple subtasks, such as web search, map navigation, calendar checking, etc., and each subtask can be completed by a basic tool. Our evaluation emphasizes both task completion rate and efficiency. The empirical studies on popular closed-source and open-source LLMs indicate that most models can perform reasonable tool planning, but differ in scheduling. For example, GLM-4.5 achieves an outperforming task completion rate of 64.72% with extensive sequential tool calls, hence suffering from significantly long execution time. By contrast, GPT-4o prioritizes parallel tool calls but achieves only a 45.08% completion rate. Considering reinforcement learning (RL) can be a viable way to improve the scheduling efficiency without compromising performance, we perform an initial study on Qwen3-1.7B and witness a 14% reduction in execution time alongside a 6% gain in task completion rate based on rarely 100 RL training samples. Our code is available https://github.com/hanwenxu1/mcp-agent.


Collaborative Scheduling of Time-dependent UAVs,Vehicles and Workers for Crowdsensing in Disaster Response

Han, Lei, Zhang, Jinhao, Liu, Jinhui, Yu, Zhiyong, Wang, Liang, Wang, Quan, Yu, Zhiwen

arXiv.org Artificial Intelligence

Frequent natural disasters cause significant losses to human society, and timely, efficient collection of post-disaster environmental information is the foundation for effective rescue operations. Due to the extreme complexity of post-disaster environments, existing sensing technologies such as mobile crowdsensing suffer from weak environmental adaptability, insufficient professional sensing capabilities, and poor practicality of sensing solutions. Therefore, this paper explores a heterogeneous multi-agent online collaborative scheduling algorithm, HoCs-MPQ, to achieve efficient collection of post-disaster environmental information. HoCs-MPQ models collaboration and conflict relationships among multiple elements through weighted undirected graph construction, and iteratively solves the maximum weight independent set based on multi-priority queues, ultimately achieving collaborative sensing scheduling of time-dependent UA Vs, vehicles, and workers. Specifically, (1) HoCs-MPQ constructs weighted undirected graph nodes based on collaborative relationships among multiple elements and quantifies their weights, then models the weighted undirected graph based on conflict relationships between nodes; (2) HoCs-MPQ solves the maximum weight independent set based on iterated local search, and accelerates the solution process using multi-priority queues. Finally, we conducted detailed experiments based on extensive real-world and simulated data. The experiments show that, compared to baseline methods (e.g., HoCs-GREEDY, HoCs-K-WTA, HoCs-MADL, and HoCs-MARL), HoCs-MPQ improves task completion rates by an average of 54.13%, 23.82%, 14.12%, and 12.89% respectively, with computation time for single online autonomous scheduling decisions not exceeding 3 seconds.


LLMAP: LLM-Assisted Multi-Objective Route Planning with User Preferences

Yuan, Liangqi, Han, Dong-Jun, Brinton, Christopher G., Brunswicker, Sabine

arXiv.org Artificial Intelligence

The rise of large language models (LLMs) has made natural language-driven route planning an emerging research area that encompasses rich user objectives. Current research exhibits two distinct approaches: direct route planning using LLM-as-Agent and graph-based searching strategies. However, LLMs in the former approach struggle to handle extensive map data, while the latter shows limited capability in understanding natural language preferences. Additionally, a more critical challenge arises from the highly heterogeneous and unpredictable spatio-temporal distribution of users across the globe. In this paper, we introduce a novel LLM-Assisted route Planning (LLMAP) system that employs an LLM-as-Parser to comprehend natural language, identify tasks, and extract user preferences and recognize task dependencies, coupled with a Multi-Step Graph construction with iterative Search (MSGS) algorithm as the underlying solver for optimal route finding. Our multi-objective optimization approach adaptively tunes objective weights to maximize points of interest (POI) quality and task completion rate while minimizing route distance, subject to three key constraints: user time limits, POI opening hours, and task dependencies. We conduct extensive experiments using 1,000 routing prompts sampled with varying complexity across 14 countries and 27 cities worldwide. The results demonstrate that our approach achieves superior performance with guarantees across multiple constraints.


Securing AI Agents with Information-Flow Control

Costa, Manuel, Köpf, Boris, Kolluri, Aashish, Paverd, Andrew, Russinovich, Mark, Salem, Ahmed, Tople, Shruti, Wutschitz, Lukas, Zanella-Béguelin, Santiago

arXiv.org Artificial Intelligence

As AI agents become increasingly autonomous and capable, ensuring their security against vulnerabilities such as prompt injection becomes critical. This paper explores the use of information-flow control (IFC) to provide security guarantees for AI agents. We present a formal model to reason about the security and expressiveness of agent planners. Using this model, we characterize the class of properties enforceable by dynamic taint-tracking and construct a taxonomy of tasks to evaluate security and utility trade-offs of planner designs. Informed by this exploration, we present Fides, a planner that tracks confidentiality and integrity labels, deterministically enforces security policies, and introduces novel primitives for selectively hiding information. Its evaluation in AgentDojo demonstrates that this approach enables us to complete a broad range of tasks with security guarantees. A tutorial to walk readers through the the concepts introduced in the paper can be found at https://github.com/microsoft/fides


MobiAgent: A Systematic Framework for Customizable Mobile Agents

Zhang, Cheng, Feng, Erhu, Zhao, Xi, Zhao, Yisheng, Gong, Wangbo, Sun, Jiahui, Du, Dong, Hua, Zhichao, Xia, Yubin, Chen, Haibo

arXiv.org Artificial Intelligence

With the rapid advancement of Vision-Language Models (VLMs), GUI-based mobile agents have emerged as a key development direction for intelligent mobile systems. However, existing agent models continue to face significant challenges in real-world task execution, particularly in terms of accuracy and efficiency. To address these limitations, we propose MobiAgent, a comprehensive mobile agent system comprising three core components: the MobiMind-series agent models, the AgentRR acceleration framework, and the MobiFlow benchmarking suite. Furthermore, recognizing that the capabilities of current mobile agents are still limited by the availability of high-quality data, we have developed an AI-assisted agile data collection pipeline that significantly reduces the cost of manual annotation. Compared to both general-purpose LLMs and specialized GUI agent models, MobiAgent achieves state-of-the-art performance in real-world mobile scenarios.


A Model Aware AIGC Task Offloading Algorithm in IIoT Edge Computing

Wang, Xin, Li, Xiao Huan, Wang, Xun

arXiv.org Artificial Intelligence

The integration of the Industrial Internet of Things (IIoT) with Artificial Intelligence-Generated Content (AIGC) offers new opportunities for smart manufacturing, but it also introduces challenges related to computation-intensive tasks and low-latency demands. Traditional generative models based on cloud computing are difficult to meet the real-time requirements of AIGC tasks in IIoT environments, and edge computing can effectively reduce latency through task offloading. However, the dynamic nature of AIGC tasks, model switching delays, and resource constraints impose higher demands on edge computing environments. To address these challenges, this paper proposes an AIGC task offloading framework tailored for IIoT edge computing environments, considering the latency and energy consumption caused by AIGC model switching for the first time. IIoT devices acted as multi-agent collaboratively offload their dynamic AIGC tasks to the most appropriate edge servers deployed with different generative models. A model aware AIGC task offloading algorithm based on Multi-Agent Deep Deterministic Policy Gradient (MADDPG-MATO) is devised to minimize the latency and energy. Experimental results show that MADDPG-MATO outperforms baseline algorithms, achieving an average reduction of 6.98% in latency, 7.12% in energy consumption, and a 3.72% increase in task completion rate across four sets of experiments with model numbers ranging from 3 to 6, it is demonstrated that the proposed algorithm is robust and efficient in dynamic, high-load IIoT environments.


Optimizing Sequential Multi-Step Tasks with Parallel LLM Agents

Zhang, Enhao, Zhu, Erkang, Bansal, Gagan, Fourney, Adam, Mozannar, Hussein, Gerrits, Jack

arXiv.org Artificial Intelligence

Large language model (LLM)-based multi-agent systems have demonstrated remarkable promise for tackling complex tasks by breaking them down into subtasks that are iteratively planned, executed, observed, and refined. Despite their effectiveness, these systems often incur high latency because real-world problems frequently demand multiple iterative cycles of reasoning steps. To address this challenge, we propose M1-Parallel, a framework that concurrently runs multiple multi-agent teams in parallel to uncover distinct solution paths. By leveraging an event-driven communication model with asynchronous messaging, M1-Parallel efficiently capitalizes on the inherent diversity of valid plans to either reduce end-to-end latency or boost task completion rates. Our experiments on complex tasks show that M1-Parallel with early termination achieves up to $2.2\times$ speedup while preserving accuracy, and that M1-Parallel with aggregation yields higher task completion rates. We further investigate strategies aimed at encouraging diverse execution plans but observe no additional performance gains over repeated sampling. Overall, these findings underscore the potential of parallel plan execution for optimizing multi-agent systems for real-world, high-complexity reasoning tasks.


Hierarchical Task Offloading for UAV-Assisted Vehicular Edge Computing via Deep Reinforcement Learning

Li, Hongbao, Jia, Ziye, He, Sijie, Guo, Kun, Wu, Qihui

arXiv.org Artificial Intelligence

With the emergence of compute-intensive and delay-sensitive applications in vehicular networks, unmanned aerial vehicles (UAVs) have emerged as a promising complement for vehicular edge computing due to the high mobility and flexible deployment. However, the existing UAV-assisted offloading strategies are insufficient in coordinating heterogeneous computing resources and adapting to dynamic network conditions. Hence, this paper proposes a dual-layer UAV-assisted edge computing architecture based on partial offloading, composed of the relay capability of high-altitude UAVs and the computing support of low-altitude UAVs. The proposed architecture enables efficient integration and coordination of heterogeneous resources. A joint optimization problem is formulated to minimize the system delay and energy consumption while ensuring the task completion rate. To solve the high-dimensional decision problem, we reformulate the problem as a Markov decision process and propose a hierarchical offloading scheme based on the soft actor-critic algorithm. The method decouples global and local decisions, where the global decisions integrate offloading ratios and trajectory planning into continuous actions, while the local scheduling is handled via designing a priority-based mechanism. Simulations are conducted and demonstrate that the proposed approach outperforms several baselines in task completion rate, system efficiency, and convergence speed, showing strong robustness and applicability in dynamic vehicular environments.